Web page classification using spatial information

نویسندگان

Miloš Kovacevic

Michelangelo Diligenti

Marco Gori

Marco Maggini

Veljko Milutinovic

چکیده

Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a “bag of words” and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Such spatial information allows the definition of heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show a preliminary experiment where our heuristics are able to correctly recognize objects in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Visual Adjacency Multigraphs – a Novel Approach for a Web Page Classification

Standard techniques for a web page classification usually take a simple text-based approach, in which most of the information provided by the visual layout of a page is discarded. In our work we propose a new classification approach based on the visual layout analyses, conducted before implementing standard classification techniques. A page is represented as a hierarchical structure – Visual Ad...

متن کامل

Web Page Structure Enhanced Feature Selection for Classification of Web Pages

Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords...

متن کامل

An Improved Optimized Web Page Classification using Firefly Algorithm with NB Classifier (WPCNB)

The web is a huge repository of information which needs for accurate automated classifiers for Web pages to maintain Web directories and to increase search engines‟ performance. In web page classification problem each term in each HTML/XML tag of each Web page can be taken as a feature, an efficient methods to select best features to reduce feature space of the Web page classification problem d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

Web page classification using spatial information

نویسندگان

چکیده

منابع مشابه

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

Data Extraction using Content-Based Handles

Visual Adjacency Multigraphs – a Novel Approach for a Web Page Classification

Web Page Structure Enhanced Feature Selection for Classification of Web Pages

An Improved Optimized Web Page Classification using Firefly Algorithm with NB Classifier (WPCNB)

عنوان ژورنال:

اشتراک گذاری